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Abstract 
<N 

O We investigate the problem of determining a set S of k indistinguishable integers in 

O the range [1, n]. The algorithm is allowed to query an integer q G [1, n], and receive a 

response comparing this integer to an integer randomly chosen from S. The algorithm 
has no control over which element of S the query q is compared to. We show tight 
I— I bounds for this problem. In particular, we show that in the natTual regime where k < n, 

the optimal number of queries to attain n~^^^^ error probability is 6(A;'^logn). In the 
Q regime where k > n, the optimal number of queries is 0(n^A;logn). 

c/2 Our main technical tools include the use of information theory to derive the lower 

1 bounds, and the application of noisy binary search in the spirit of Feige, Raghavan, 
^ Peleg, and Upfal (1994). In particular, our lower bound technique is likely to be 

^ applicable in other situations that involve search under uncertainty. 
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1 Introduction 



This paper investigates the problem of identifying a set S of indistinguishable items by 
repeated queries where we know the range of values the items can take. At every query, we 
gain information based on our query and some random item from the set S we are trying to 
find (we do not know which item was chosen). The overall simple statement of the problem 
makes it widely generalizable. The query can be thought of as an experiment in which we 
apply a measurement on an element of 5* without knowing which element has been measured. 
The set of items can refer to a set of DNA strands in a "soup" of DNAs, passwords or any 
item that we might be interested in finding when we know what possible values the item may 
take. The queries can be viewed as tests on DNA strands, attempts at guessing a password or 
any trial we may run that will provide some information about one of the items in question. 
The specific problem we investigate is where the items are integers. Our queries are guesses 
of integers which return the result of a comparison with a chosen integer from the set we are 
trying to find. 

As far as we know, this problem has not been investigated in the literature. However, it 
falls into the rich class of noisy search problems. Since we do not know which number was 
chosen when we query a number, we have to deal with a lack of information in trying to 
determine the set of numbers. Due to this missing information, it is not immediately obvious 
that there exists a solution to the problem. 

In this paper we give asymptotically tight upper and lower bounds for the number of 
queries needed to find a set S of size k of numbers from {1, . . . , n}, where the queries are 
comparison queries. 

We briefiy discuss similar problems that have been previously studied. Feige et al. 
explored the depth of noisy decision trees, where each node can be wrong with some constant 
probability, in [3]. One of the problems they investigated is binary search where the result of 
each query is wrong with a constant probability. They presented an algorithm to solve this 
with running time ©(log ^) where n is the input set size and Q is the probability of error of 
the algorithm. The algorithm we present uses a similar technique to the one used for noisy 
binary search in [3]. 

The Renyi-Ulam game is also a related problem. In one variation of this game, we need 
to discover a chosen integer. To do this, we query a number and are told whether the number 
we are trying to find is greater than the number we guessed or not. However, some constant 
number of lies are allowed. In [10], one lie is allowed, which means that one of the responses 
to our queries can be false. Similarly, Pelc discussed in [7] an algorithm for performing 
the search when one lie is allowed and concluded that the original question posed by Ulam 
(finding an integer between one and a million with one lie allowed) requires 25 queries. In 
[lOj . and other papers that explore the Renyi-Ulam game, some restriction is placed on 
the pattern of queries with false results. Ravikumar and Lakshmanan discussed such patterns 
(and why they are necessary to make the problem solvable) in |9]. 

Another related problem is sorting from noisy information. Braverman and Mossel 
investigated this in pQ. The problem of sorting from noisy information is similar to our 
problem because in noisy sorting we can make comparisons between the items that need to 
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be sorted, but each comparison may give us false information. This has apphcations, for 
example, in ranking sports teams where the comparisons are games between teams (one team 
wins) but the comparisons are noisy because the better team (which should have a higher 
rank) does not always win. Klein et al. also investigated this problem in [5j. Apart from 
noisy sorting, they applied the same model to explore other problems, such as finding the 
maximum of n numbers. 

The problem we are investigating is motivated by applications that involve a search 
for several items by repeated queries where we do not know which item was chosen to be 
compared with our query (i.e. the items are indistinguishable). One interpretation is where 
the items represent DNA strands in a mixture that we are trying to identify. We can perform 
tests that give us some information about one of the DNA strands in the mixture, but we 
do not know which one. Similarly, instead of trying to identify DNA strands, we might be 
trying to identify passwords where our queries give us some partial information about one 
password out of several that a particular user often uses (and switches between). 

We note that the applications mentioned do not take the exact form as the problem we 
explore. The items in our problem are integers and the queries are guesses of an integer that 
result in the response 'less than or equal to' or 'greater than'. In generalizing the problem 
to other applications, the form of items or queries may change. For example, the queries 
in the DNA mixture example may describe a property of a particular nucleotide instead of 
returning one of two possible answers. Therefore, the algorithm will have to be changed. 
However, a similar framework can be used which allows information to be gained despite the 
uncertainty regarding query responses due to the indistinguishability of the items. A solution 
to the problem we have posed can lead to the development of new methods for identifying a 
set of items where we know these items can only take on a certain range of values. On the 
lower-bound side, our results show that information-theoretic quantities are very effective at 
measuring and upper-bounding information learned from queries, even when such information 
is only a fraction of one bit. We believe that the information-theoretic lower bound technique 
will generalize to tight lower bounds in other settings. 

We now discuss the results and structure of the paper. In Section |2} we formally introduce 
the problem we are solving with the restriction that the number of chosen integers is 
significantly smaller than the range of integers available. We prove a lower bound for the 
problem in Section 3.1 using information theoretic techniques. This involves constructing the 
hard instances where we split the possible values the chosen integers can take into consecutive 
clusters of equal size and place one chosen integer in each such cluster. Intuitively, this forces 
the search algorithm to find the elements one at a time, which turns out to be costly due to 
the fact that we don't control the sample. To formalize this intuition, we calculate the entropy 
of the random variable representing a particular chosen integer (it may take values of the 
integers in one of the clusters described above). We then use the mutual information of this 
random variable and the random variable representing the responses to the queries we make 
to find the minimum number of queries required to find that chosen integer. After showing 
that the same minimum number of queries applies to at least half of the chosen integers, we 
reach a lower bound of 0(/i;^logn), where k is the size of the set S and the elements of S 
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take integer values between 1 and n (inclusive). Further, this bound extends to all k < n, 
using a slightly different set of hard instances. When k > n we obtain a lower bound of 
n{k'^nlogn). In Section [4| we present an optimal algorithm for solving the problem, proving 
both its correctness and worst case running time of 0(A;'^log |) where 6 is the probability of 
error. This shows that the lower bound is tight. Moreover, while the lower bound applies 
to finding S even with a constant error probability, we see that the upper bound remains 
asymptotically the same even if we set the error S = n~'^^^^ to be polynomially small. 

Our results show that the problem we describe can be solved in practice when the items 
we are searching for can take a large number of values. This is because the dependence of 
the running time on n grows as logn. However, the number of items in S needs to remain 
small because the dependence of the running time on k grows as k^. 

2 Problem definition 

We consider a (multi-)set S of k distinct integers where each is Xj G {1, 2, . . . , n} for 1 < i < k. 
Our goal is to discover the set S. The process is to repeat the following three steps: 

1. Query an integer Y E {1,2, . . . , n}. 

2. An integer Xj is selected from S uniformly at random. 

3. We are told whether Xj < F or Xj > F. 

These three steps are repeated until we know what the k integers in S are. Our goal is to find 
the most efficient algorithm for determining S. Our model of computation is that queries are 
the costly operations. Therefore, by finding the most efficient algorithm we mean finding the 
algorithm that minimizes the number of queries made. We refer to this as 'the problem' we 
are solving. Furthermore, for brevity, we refer to the two possible responses to queries as '<' 
(Xj < Y) and '>' (X, > Y) and the k integers in 5* as 'the chosen integers'. 

In this paper we give a complete characterization of the query complexity of this problem. 
Note that since the Xj is selected at random from S, we cannot hope for a deterministic 
algorithm, and have to settle for a probabilistic performance guarantee. We focus on the 
regime where we are required to output the correct set S except with some (possibly constant) 
probability 6. The answer can be broken down into three main regimes, which will be 
discussed in the analysis: (1) k <^ n, e.g. k < ^Jn; (2) ^Jn < k < n; and (3) k > n. The 
answer is given by the following main theorem: 

Theorem 1. The number of queries needed to determine a multi-set S C [n] of size k with a 
given error n^'~'^^^ < 5 < l/A is Q(k^\ogn) when k <n, and Q(k'^nlogn) when k>n. 

Note that the distinction between k < ^/n and y/n <k < n only comes up in the analysis, 
but (asymptotically) makes no difference in the result. 

Remark 2. Because of the way the algorithms work, Theorem\^ remains true even if the 

comparisons in the query answers are themselves noisy, and output the correct value of 
? 

Xj > y correctly only with probability 1/2 + 7 for some constant 7 > 0. 
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Remark 3. Somewhat surprisingly, same bounds hold for a fairly broad range of error 
parameters. In particular, the lower bound holds even when the error is constant, while the 
upper bound holds even for polynomially small errors (the constant in the G(-) may depend 
on the constant 13 in S = n~^). 

3 The lower bounds 

We begin with showing the lower bound. In fact, we break the lower bound into two regimes: 
k < ^Jn and k > ^/n. In the former regime, we use information-theoretic techniques to show 
the lower bound. In the latter, we give a more straightforward proof of the Q{k^logk) lower 
bound when k <n, and Vl{k'^n\ogn) when k > n. The Vl{k^\ogk) lower bound is weaker in 
general than VL{k^\ogn) when k < n, but is equivalent in the regime where k > ^Jn. 

3.1 The case k < y^: an information-theoretic lower bound 

The main technical ingredient in the lower bound proof is the KuUback-Leibler divergence 
and mutual information. We first introduce these terms and the lemmas we will use. For a 
more thorough introduction to these, see [2]. 

The KuUback-Leibler divergence (KL-divergence) measures the difference between two 
probability distributions: 

Definition 4. For discrete random variables P andQ over sample space the KL-Divergence 
is defined as: 



with the convention that the term in the sum is interpreted as when P{i) = and +oo 
when P{i) > and Q{i) = 

We also use mutual information, which we define and arrange into a form we will use: 

Definition 5. Mutual information is a measure of the correlation between two random 
variables. The more independent the variables are, the lower the mutual information is. 



Before we rearrange this definition into a form we will use, we first note (from j2]) that it 
can also be written in terms of the more familiar Shannon entropy as: 



Since H{X) > H{X\Y), I{X; Y) > 0. If entropy is interpreted as the uncertainty regarding 
a probability distribution, we see that the mutual information between X and Y represents 
the reduction in uncertainty of X by knowing Y. 




I{X-Y) = DKMx,y)Mx)p{y)) 



I{X-Y)=H{X)-H{X\Y). 
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We now return to the original definition given for mutual information. Using the definition 

v{y) 
p{x\y) 



of the KL-divergence and conditional probability {p{x\y) = ^^^yT'), we have 



/(X;r) = 5^p(y)5^p(a;|y)log: 

= ^p{y)DKL{p{x\y)\\p{x)) 



y 

= EY[DKMx\y)Mx))] 

Thus we see that the mutual information is the expectation of the KL-divergence between the 
probability distribution of X and the probability distribution of X conditioned on Y. If these 
two distributions have a high KL-divergence, then knowing Y provides us a high amount of 
information regarding the probability distribution of X. This is equivalent to saying that the 
mutual information of X and Y is high. 

We will use the chain rule for mutual information: 

Lemma 6. /(X; Y,, F2, • • • , Y^) = I{X; Y,) + I{X; Y2\Yi) + ... + I{X; ni^-i, ...,¥2, Y,) 

For a proof of the above lemma, see [2j. We are now done defining the information 
theory terms we will need. Lastly, we will need the following lemma which describes the 
KL-divergence between two Bernoulli random variables with a similar probability of success: 

Lemma 7. Dkl{Bp±s\\Bp) = 0{e^) where Bp is a Bernoulli random variable with probability 
of success p, ^ < p < I and £ < I- 

Proof. Here we prove the plus part of the lemma {DxL{Bp-^-£\\Bp) = 0(e^)). The minus part 
is nearly identical and is thus excluded. 



Dkl{Bp+£\\Bp) 



1 — p — e" 




p{l - p-e) 
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Use the inequalities 1 + x < and 1 — x < e ^: 

DKLiBp+e\\Bp) < log (ep^e^'-^'^^^ + eloge^ 
= log2e° + 



ln2p(l — p — e) 
i < P < I and £ < |, p(l - p - £) > I (1 - I - I) 

32£2 



DKL{B,+e\\B,) < , 

= 0{e^) □ 

We are now ready to begin our proof of the lower bound. The approach taken is to show 
that the information gain from each query is small compared with the total information 
required to find a certain chosen integer. This will allow us to show that a certain minimum 
number of queries is required to find each of the k integers. 

Lemma 8. The lower bound for the number of queries required to find the k integers between 
1 and n in the set S with probability > 0.99, when 8 < k < y/n, is Q{k^ logn) 

Proof. We choose our input as follows. Split the integers in the range [1,^] into k equally 
sized clusters. Call these clusters Gi, G2, ■ ■ ■ , Gk- Let there be one of the k chosen integers in 
each such cluster. This integer is chosen uniformly at random from the integers in the cluster. 
Note that the number of integers in each cluster is |, which, without loss of generality, we 
will assume is an integer. See Figure [T] for a visualization of this. 

, ^ n n ^ 2n 
1,2, -r-r+l -r 



k k k 

k , 

Gi G2 Gk 



' , ' , ' ■■• 



Figure 1: Visualization of our partition of the integers between 1 and n 

We consider individually a cluster Gi where < ^ < Let L be the random variable 
that represents the chosen integer in Gi. Since this number is chosen uniformly at random 
from I elements, the probability of each integer being the chosen integer is P{x) = ^ = ^. 

Therefore, the entropy of L is H{L) = J2x P{^) log p|;^ = J2i=i n f = log f • We now 
define Qj to be a Bernoulli random variable representing the response to the j^^ query (i.e. 
either '<' or '>'). We need to make enough queries so that the information gain relevant to L 
is close to the entropy of L in order to determine the chosen number in Gi with a high degree 
of accuracy. This is equivalent to saying that the mutual information between L and the 
queries made Qi,Q2, ■ ■ ■ ,Qi is at least a constant times the entropy of L. Indeed, in the end, 
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we must have determined the point with probabihty greater than 0.99. Therefore, conditioned 
on the queries, most of the mass is concentrated on one point and H{L\Qi, ... ,Qi) < 0.2 log |. 
Therefore, /(L; Qi,...,Qi)= H{L) - H{L\Qi, ...,Qi) = fi(log f ). Thus, we need: 

77 

I{L;Q,,Q2,...,Qi)>n{\og-), (1) 

where / is the number of queries made. We want to find the minimum / for which this is true. 
First, we use Lemma [6] (chain rule) to write: 



/(L; Qi, Q2, . . . , Qi) = HL; Q^) + /(L; Q2IQ1) + • • • + Q^Qz-i, • • • , Q2, Qi). (2) 

Take one of these terms and recall that we can express mutual information in terms of 
KL-divergence: 

i{L- g,|g,-i, . . . , Qi) = Eq[Dkl{p{Q,\l, Q,-i, . . . , Qi)MQj\Q,~i, • • • , Qi))] 

where 1 < j < Thus, we need to find the KL-divergence of . . . , Qi and of 

QjIQj-i, . . . We note that since we chose cluster Gj, there are 2 — 1 of the k chosen 
integers that are smaller and k — i oi the k numbers that are bigger than any element 
of Gi. Therefore, for both probability distributions, the probability that the response is 
'<' is at least ^ and the probability that the response is '>' is at least Therefore, 
both probability distributions are Bernoulli with probability of success (taking success to 
be the response '<') between ^ and 1 — ^ = |. Thus, the difference in probabilities of 
success of the two distributions is at most f ~ ^ = |- Then if we let Qj\L, Qj-i, . . . ,Qi 
be Bp and let Qj\Qj-i, . . . ,Qi be Bp±e, we know | < p < | (because < < ^) and 
< e < ^ (because this is the maximum difference in probability of success between the 
two distributions). By lemma [?} DKL{p{Qj\L,Qj-i, . . . ,Qi)\\p{Qj\Qj-i, ■ ■ - ^Qi)) = O(e^) = 
O(^). So: EQ[DKL{piQj\L,Qj-i, . . ■ ,Qi)\\piQj\Qj-i, ■ ■ ■,Qi))] = 0{^) and we have: 



/(L;Q,|Q,_i,...,Qi) = 

Returning to equation |2j 



I 

I{L; Qi, Q2,...,Qi) = Y, HL; Q,\Qj-i, ■■■,Qi] 



From g, we have 0(1^) > fi(log f ) so 

/ = Q {kHog^^ = Q{kHogn) 



since k < ^/n. This is the minimum number of queries to find the chosen integer in Gi. This 

3fc _ fc+4 _|_ 1 _ 
4 4 2 



holds in total for ^ — + 1 = | of the k chosen numbers (this is the number of clusters Gi 
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with i in the range we considered). Note that to find the chosen number in Gj, queries made 
in determining the number within Gj with j ^ i provide no information for determining 
the number in Gi (as all queries are either bigger or smaller than all the numbers in Gi). 
Then finding | of the k chosen numbers requires at least O (|/c^logn) = 0(/c^logn) time. 
Therefore, finding all k of the chosen numbers requires at least fl{k^logn) queries. □ 



3.2 The lower bound when k > \fn 



Next we turn our attention to the lower bound in the regime where k > \/n. We start with 
the case ^Jn < k < n — 2, as the case k > n — 2 is treated very similarly. The multi-set S 
is constructed as follows: we place k/4 I's and k/4 n's in S. Partition the rest of the set 
{1, . . . , n} into bins Bi = {2, 3}, B2 = {4, 5}, etc. For each bin Bi for i = 1,2, . . . , k/2, we 
place exactly one of the elements of Bi in S independently and uniformly at random. We 
now look at the process of determining which element of B^ has been selected using the 
queries. Note that only the query with Y — 2i carries any information on which element of 
Bi has been selected. Thus a set of observations can be specified by a set of pairs of numbers 
{(/j, /ij)}*^^ where li represents the number of times we queried Y = 2i and received the '<' 
answer, and hi represents the number of times we received the '>' answer. The probability 
of each answer is between 1/4 and 3/4, and varies by l//c depending on whether we selected 
2i or 2i + 1 in B^. 

When we output the set S, we need to make k/2 decisions of whether to output 2i or 
2i + 1 for each Bi. Each of these decisions should depend only on the values of {li,hi), 
and should maximize the probability that the output is correct. This can only be done by 
outputting the maximum likelihood value for each Bi. More precisely, we should output 2i if 
^ k/i+i-\/2 ^ 2i + 1 otherwise. We are not particularly concerned with these details, 
but only with the probability that our output is wrong. Denote by > the probability 
that the maximum-likelihood output given hi) is incorrect. We first claim that to have a 
probability of > 0.9 to be correct in outputting S, we must have a bound on the sum of the 
Si's. 

Claim 9. // given the values {(/i, the output S is correct with probability > 0.5, then 

Proof. Since the events of being correct on each Bi are independent, the probability of being 
correct on all Bi's is given by 

fc/2 

0.5 < < e-^* 



-.k/2 
=1 



1 



which implies the statement of the claim. □ 

Next, let us denote by /x, the a-priori expected number of '<' responses on /j + hi queries, 
and let di :— \li — iii\ be the observed deviation from this expected value. Intuitively, the 
greater this deviation, the greater is our confidence in the answer. In fact, it is not hard to 
formalize this intuition: 
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Claim 10. For each i, and k > 25, Ei > e ^O'^^/'^/S. 

Proof. Suppose wlog that /j > /Xj, and thus we are outputting 2i. Denote p = and 
q = ■ We have by Bayes' rule 



2{p + ~ 2{p + l/A;)^'«-ig'»+'^'-/^«+i (p + l/A;)'«-M«+igM»-'«-i 

Pr[{fi,-l,k + h,-fi, + l)\2z + l] r 1/k Y'^' 1^1, lAV' 

2Pr[{iJi-i,h + hi- ^ii + i)\2t] V p + V qj 

(1/2) • (1 - ^/kf^'^^' > e-(5A)(2rf.+2)/2 > e-io-^^/Vs. 

The second-to last inequality follows from the fact that the breakdown (/ij — 1, /j + /ij — + 1) 
is more likely under the selection of 2i + 1 than under the selection of 2i. □ 



> 



Putting Claims [9] and 10 together we see that assuming the probability that the output S 
is correct is > 0.5, we must have 

k/2 

1=1 

Claim 11. Equation ^ implies Xli^ii '^i > ^^^k, for k > 40. 

Proof. Denote := e~^'^°''/^, and let /(x) := — Inx. The function f(x) is convex, and thus 
we have 

1=1 i=i y t=i J 

since k > 40. This implies the claim. □ 

To finish the proof let Dt denote the random variable representing the value of Yli=i 
after t queries. Let Zt = Dt — ^. At each time step, a query to F = 2z will on average not 
change di if the element from Bi is not selected for comparison with Y . If it is selected, it 
will change di by at most 1. Thus, on average, Dt only grows by at most | after each time 
step. Thus Zt is a supermartingale. Let T be the random variable representing the time at 
which we stop and output S. By the optional stopping time theorem, we have E[Zt\ < 0, 
which implies E[T] > k ■ E[Dt]. 

If our overall success probability is > 0.75, it must be the case that with probability 
> 1/2 the probability of the output 5* being correct conditioned on the observed {(/j, 
is > 1/2. Thus by Claims [q [io| and pTj we have Dt > |^lnfc with probability > 1/2. Thus, 

1 

E[T] >k-E[DT] > k - -■ — \nk = n{kHogk), 
completing the proof of the lower bound. 
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Remark 12. The proof in the regime k > n 
there are n/2 bins now, and we'd get E[Dt] 
E[T] = n{k^n\ogn). 

We will now study the case where k < n. 

4 Optimal upper bounds 



— 2 is very similar. The only difference is that 
= ri{knlogn) instead of Q{k'^ log n) , and thus 



As discussed in the previous sections, it is not immediately clear how to make use of 
the information gained from queries because we do not know which of the k integers the 
information corresponds to. In this section, we present an algorithm for solving this problem. 
The algorithm is optimal when the probability of error required is constant (which means 
its worst case running time matches the lower bound). Our algorithm finds each of the k 
numbers individually, without attempting to use information gained when finding one integer 
to find another integer. We first introduce a concept we will use in all our algorithms: 

Definition 13. The k-position of an integer y is the number of integers in S that have a 
value less than or equal to y 

The general technique of the algorithms is to do a binary search for a chosen integer, but 
repeat each query of the binary search enough times to know the fc-position of the queried 
integer. A straightforward application of binary search with repeated queries would take 
nikHog^ n) queries to find the /c-position of a number, even with a constant error probability. 
We essentially use the noisy binary search technique of Feige et. al. [3] to attain the optimal 
query complexity. We start with the following simple lemma: 

Lemma 14. We can find the k-position of integer y by making 2k^ log | queries with the 
probability of being correct being at least 1 — 6. 

Proof. Let Ky be the fc-position of y. We do m queries of y to find Ky. For each query Qi, 
the probability of a response being '<' or '>' is given simply in terms of Ky-. 

Pr[Q. ='<'] = ^ 
Pr[Q. ='>'] = 

because Ky is the number of integers in 5* less than or equal to y and each such integer is 
chosen as the Xi for a query with equal probability. We use the analogy that the random 
variable Qi is a coin with probability of heads (which represents '<') being p = Given m 
tosses of the coin, of which x are heads, we can approximate p as: p = ^. We need to find the 
relation between the number of tosses m and the probability of error in this approximation. 
Using standard concentration bounds |8], we see that m > log | coin tosses are needed to 
guarantee that \p — p\ < s with error at most S (where £ > 0). 
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We need to decide on a value for e. Note that Ky is an integer in the range [1, k] and 
therefore, p can only take on the values 0, |, |, . . . , |. Thus, we need s < ^ so then we can 
always round p to the closest |, where i G Z and < i < fc. Using this in the results from 
[8j , we see that m = 2fc^ log | coin tosses are enough to guarantee that we know the correct 
value of p with probability of error being at most 6. Given p, we have Ky = kp so we have 
the /c-position of y. □ 

We note that this immediately lets us solve the problem for k > n: 

Corollary 15. When k >n, there is an 0{k'^nlogn) algorithm to find all k integers in S 
with probability 1 — n^'^ for all constant c > 0. 

Proof. We find the /c-position of all n integers in the range [l,n]. Given the A;-position of all 
n integers, we know how many of the k numbers have each integer value. If the /c-position of 
y — 1 is i and the /c-position of y is i + j, we know there are j of the chosen numbers with 
the value Y (for 1 <Y < n. For F = 1, we know the number of the chosen integers with this 
value is equal to the /c-position of Y). 

To find the /c-position of an integer with probability of error at most 5, we need to perform 
0(/c^ log |) queries. If we want the probability of error of the algorithm to be a constant, we 
need the probability of error of finding the /c-position of each integer to be at most 5 = n~^^~^^^ 
so that applying a union bound gives a total probability of error < n^'^ (since we find the 
/c-position of n integers). Thus, to find the /c-position of each integer we need to perform 

O ( /c^ log ) = O (/c^ logn) queries. Since we do this for n integers, the total number of 
queries we make is: 0{k^nlogn). □ 

We provide an example to illustrate an approach using what we have so far. Suppose 
n = 16, k = 2, S = {3, 10} and let m = 2/c^ log | = 8 log |. We want to find the lowest of the 
k numbers first. We do a binary search where we repeat each query m times. Our decision at 
each stage of the binary search is determined by the /c-position found for the number of that 
stage. Therefore, we first do m queries oi y = 8. From this, we calculate Ky = 1 (note the 
probability of error for this statement is 6). So there is one of the k numbers below or equal 
to 8. Next, we do m queries of y = A and find again that Ky = 1. When we do m queries of 
y = 2, we find that Ky = 0. This tells us that none of the k numbers are below or equal to 2. 
Therefore, we do m queries of y = 3 and find that Ky = 1. If one of the k numbers is less 
than or equal to 3, but none of them are less than or equal to 2, we conclude that one of the 
k numbers is 3. We then repeat the same process to find the second of the k numbers. 

However, this approach is problematic because of the constant error each time we find the 
/c-position of a number. This flaw is mentioned for a similar algorithm in jl]. The number 
of queries we make is 0{mk log n) = O {k^ log n log |) . Each group of queries of the same y 
{m of them) give the wrong result with probability 6. Applying a union bound, our overall 
probability of error (A) is A = /clog {n)6. If we want A to be a constant, we need 6 = ^ 
and thus, the number of queries we make is actually O ijc" log (n) log (2/clogn)) 

To alleviate this problem, we model our algorithm as a random walk on a tree. In using 
this technique, we follow [3]. In [3], the random walk approach is taken to do a noisy binary 
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search. We use this technique to find each of the chosen k integers, although each step of 
the random walk is modified to accommodate our lack of information about which of the 
k integers was chosen in a particular query. We use a binary tree where the leaves are (in 
order) the integers 1,2, ... ,n. The internal nodes represent intervals that are the union of 
the leaves in their subtrees. For example, the root node has the interval [l,n] and the left 
child of the root has the interval [1, J]. The tree height is logn. Finally, we extend this 
tree by adding chains of length m' = O(logn) to each of the leaf nodes, where the nodes in 
these chains have the same value as the leaf they are attached to. An example tree with 
n = 4 is shown in Figure [2] below. 



[1, 




Figure 2: Tree for the random walk with n = 4 



4.1 Algorithm 

We discuss an algorithm for finding the t^^ of the k chosen integers. This algorithm is repeated 
k times (once for each of the k numbers). Starting at the root, for each node v we take the 
following two steps: 
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1. We first check whether the t*^ chosen integer is in the range of the node (call it [a, b]). 

To do this, we find the fc-position of a — 1 and b by doing Sk"^ queries of each of them. 
If we find that the /c-position of a — 1 is at most t — 1 and the A;-position of b is at least 
t, then the t^^ number lies in the range [a, b]. Otherwise, we backtrack up the tree to 
the parent node of v . 

2. If, according to the first step, the t*^ number lies in the range [a.b], we do lOA;^ queries 
of the middle value of the range of the node (call this u where u = |_^y^J). If v is not a 
leaf (or on a leaf chain) and the fc-position of u is at most t — 1, we choose the right 
child of f . If the fc-position of u is at least t, we choose the left child of v. If f is a leaf 
(or on a leaf chain), we go down the chain further regardless of the result of the queries. 

Note that there is a constant probability of error each time we determine the /c-position 
of an integer. This leads to a constant probability of choosing the wrong node to go to next. 
We will analyze this probability shortly. 

The algorithm walks for m = O(logn) steps and then stops, where m < m'. If it stops on 
an internal node, the algorithm failed. If it stops on one of the leaf chains (or a leaf node) , 
it outputs the value of the leaf (i.e. declares this value to be the value of the t*^ of the k 
numbers). 

The following theorem summarizes our results: 

Theorem 16. Our algorithm finds all k integers in S in O (/c^log (|)) time with probability 
of error at most S for k < n 

To reach this theorem, we use the following lemma: 

Lemma 17. The algorithm finds the correct t^^ integer in S with the probability of error 
being at most e~i^ , where m is the number of steps in the random walk. 

Proof. We need to prove that the algorithm's position on the walk after m steps is the correct 
leaf chain with high probability. Orient all edges of the tree so they are directed towards 
the correct leaf chain (and within this leaf chain they are directed down). We can do this 
because the graph is a tree (there is only one path between every two vertices) and there 
is only one correct leaf. We can now consider the algorithm's position in the tree as a one 
dimensional random walk. We let the starting point of the walk be (the root of the tree), 
the correct leaf be R steps to the right and any of the wrong leaves be R steps to the left. 
Note that R — log n (height of the tree) . 

We need to find the probabilities of moving left and right in the random walk. We will 
show that the probability of moving in the correct direction (to the right) is at least 0.7 
at every node. Furthermore, note that the decision made at any node is independent of 
the previous steps in the random walk. Let q be the probability of going left at any move. 
This is equivalent to the probability of going along the wrong direction of an edge, which is 
equivalent to making a mistake somewhere in choosing the next vertex. The probability of 
incorrectly calculating whether the t^^ number is in the range [a, b] is at most the probability 
that we incorrectly calculate the /c-position of either a — 1 or &. Since we do 8A;^ queries 
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of each, by Lemma 14 we know that the probabiUty of error in calculating the fc-position 



of each is 6 where 2 log | = 8^5 = |. So the probability of incorrectly calculating the 
/c-position of either a — 1 or 6 is at most 1 — (|)^ = Similarly, we do lOk^ queries of u, so 
the probability of error is 6 where 2 log | = 10 =^ 5 = ^. Thus, the total probability of error 
at each node is ^ + < 0.3. Therefore, q < 0.3 and p > 0.7, where p is the probability of 
going to the right (i.e. the correct direction). Figure [3] illustrates the random walk space. 



Iiiconect leaf Start Correct leaf 

-R R = logn 

O(logn) O(logn) 

Leaf branch Leaf branch 



Figure 3: The random walk space 

For the algorithm to be correct, it must be on or to the right of R after m steps (so it 
returns the correct integer), otherwise it is wrong. Let X be the random variable denoting 
the number of moves to the right made after m moves. Then m — X is the number of moves 
to the left. Therefore, the algorithm is correct if X — (m — X) = 2X — m > R. This is 
equivalent to the condition that X > ^^^^^ . Then the probability that the algorithm is correct 
is Pr[X > = 1 - Pr[X < and Pr[X < ^] is the probability of error we want 

to bound. To find -E[X], let Xj be an indicator random variable that is 1 if the algorithm 
moves to the right on the i^^ move and otherwise. Note that Pr[Xj = 1] = p ^ E[Xi] = p. 
Therefore, E[X] = E[Xi] + -E[X2] + . . . + i?[Xm] = pm by linearity of expectation. We want 
to use a Chernoff bound to bound the probability of error, so we need to find a 6 such that: 

~'~ = (1 — 5)pm 



1-6 



2 

m + R 



2pm 

2pm — m — R 

^6 = — 

2pm 

Note that < 5 < 1 because < 2pm — m — R < 2pm. Since each step of the random walk 
is independent of the other steps (i.e. Xj is independent of Xj for i 7^ j), we can use the 
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Chernoff bound ([B]): 



„ rx- rn + R 
Pr[X < ] 





SEIX] 


< e" 


2 




f 2pm — m — R\ 


= 


V 2pm / 




{2pm. — m — R)^ 


= e~ 


Spm 



' pm 



Recall that n > 0.7 and set m = xlogn, where x is a constant. Then (^p™ ™ > 
logn. We want to write this as ^ where is a constant. Then d = , ^ • Note 



that as X increases, d decreases to some asymptotic value: 

X 5.6x 
lim -TT = lim 



5.63: 



2 



5.6a; ^ ' ^ 

5.6 

= lim 



0.42 

Then we have that (^p™--"^--^) > Therefore, 



Pr[X < — - — ] < e-35. 
Thus, we have bounded the probability of error as required. □ 



We apply Lemma 17 to prove the bound on the full algorithm. Even though our lower 
bounds works when the error probability is constant, the algorithm applies even when the 
error is very small (n-^(^)). We are now ready to present the proof for Theorem 
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Proof. We prove separately the cases when 6 > - and when 6 < -. In the first case, we set 
m = 70 log n. By Lemma 17, the probability of not finding the correct t*'^ number is at most 



e °35' = e^'^" ^^'"^ < Applying a union bound of this over the k numbers we need to 
find, the probability of error is at most A < ^ = -Lr because k < ^Jn. Since ^ < - < 5, 
the probability of error is bounded as required. So we need in total 70fclogn steps of the 
random walk algorithm. Recall that each such step takes O(fc^) queries. Therefore, in total, 
we have a running time of O(70A;^logn) = 0(A;^logn) = 0(fc^log|) since 6 < 1. 

We now consider the case when 6 < -. Set m = 70 log k. The probability of not finding 



70 lot 



the correct t*'* number is at most e = e~i5^ '"^ ^ < 5^ by Lemma 17 (and that 6 < 1). 

Applying a union bound over the k numbers we need to find, the overall probability of error is 
< n(5^ < (5 as required. Thus, we need O {k log |) steps in the random walk, where each 
consists of 0{k'^) queries. Therefore, the total running time is O {k^ log \) = O [k^ log f ) . □ 
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